EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing
نویسندگان
چکیده
In scene parsing, the model is required to be able process complex multi-modal data such as images and contexts in real scenes, discover their implicit connections from objects existing scene. As a storage method that contains entity information relationship between entities, knowledge graph can well express semantic this paper, new multi-phase was proposed solve parsing tasks; first, used align then graph-based generates results. We also designed an experiment of feature engineering’s validation for deep-learning preliminarily verify effectiveness method. Hence, we representation named Entity Descriptor Encoder Transformer (EDET), which uses both itself its internal attributes representation. This embedded into transformer structure tasks. EDET aggregate results generation image captioning tasks prove has excellent performance fields. Finally, applied industrial scene, confirmed viability our
منابع مشابه
Scene Graph Parsing as Dependency Parsing
In this paper, we study the problem of parsing structured knowledge graphs from textual descriptions. In particular, we consider the scene graph representation (Johnson et al., 2015) that considers objects together with their attributes and relations: this representation has been proved useful across a variety of vision and language applications. We begin by introducing an alternative but equiv...
متن کاملMulti-modal Variational Encoder-Decoders
Recent advances in neural variational inference have facilitated efficient training of powerful directed graphical models with continuous latent variables, such as variational autoencoders. However, these models usually assume simple, unimodal priors — such as the multivariate Gaussian distribution — yet many realworld data distributions are highly complex and multi-modal. Examples of complex a...
متن کاملMulti-Modal Scene Interpretation
The visionary goal of developing an easy to use service robot implies several key tasks such as speech understanding, object recognition and scene understanding. Besides the more sensor-oriented capabilities such systems need extensive meta knowledge, e.g., about mental representations of spatial relations to match the view between man and machine. Only if all parts fit together an unrestricted...
متن کاملCo-inference for Multi-modal Scene Analysis
We address the problem of understanding scenes from multiple sources of sensor data (e.g., a camera and a laser scanner) in the case where there is no one-to-one correspondence across modalities (e.g., pixels and 3-D points). This is an important scenario that frequently arises in practice not only when two different types of sensors are used, but also when the sensors are not co-located and ha...
متن کاملMulti-Modal Scene Understanding for Robotic Grasping
Current robotics research is largely driven by the vision of creating an intelligent being that can perform dangerous, difficult or unpopular tasks. These can for example be exploring the surface of planet mars or the bottom of the ocean, maintaining a furnace or assembling a car. They can also be more mundane such as cleaning an apartment or fetching groceries. This vision has been pursued sin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Applied sciences
سال: 2023
ISSN: ['2076-3417']
DOI: https://doi.org/10.3390/app13127115